Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve usability of Directory datatype #17614

Merged
merged 17 commits into from
Oct 30, 2024

Conversation

wm75
Copy link
Contributor

@wm75 wm75 commented Mar 6, 2024

This adds functionality to the Directory datatype class, which can now be displayed and downloaded as an archive.
It also adds a new archive_to_directory converter that generalizes the existing tar_to_directory one to work with tar and zip archives. Also updates the older converter's requirement to an existing version of the galaxy-util package. Previously the exact requirement wasn't installable via conda.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. upload an archive as either zip or tar
    2. convert the datatype to directory
    3. explore

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 24.1 milestone Mar 6, 2024
@wm75
Copy link
Contributor Author

wm75 commented Mar 6, 2024

@davelopez step 1 for zarr datatype integration

@wm75 wm75 requested review from mvdbeek and bernt-matthias March 6, 2024 09:55
@wm75
Copy link
Contributor Author

wm75 commented Mar 6, 2024

@astrovsky01 could maybe be an interesting alternative to your colabfold tar archive?

Copy link
Contributor

@bernt-matthias bernt-matthias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have some unit tests for setting metadata. Maybe here https://github.com/galaxyproject/galaxy/blob/dev/test/unit/data/datatypes/test_data.py.

There should be a bit of inspiration for such tests in the folder.

lib/galaxy/config/sample/datatypes_conf.xml.sample Outdated Show resolved Hide resolved
lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved
lib/galaxy/datatypes/data.py Show resolved Hide resolved
lib/galaxy/datatypes/data.py Outdated Show resolved Hide resolved
@wm75
Copy link
Contributor Author

wm75 commented Mar 6, 2024

Ok, your opinion about the display_data part is a bit diappointing for me. I've spent a considerable amount of time looking at the existing code for the Data datatype class and actually thought that I had implemented the display part in the spirit of that parent class' code, but seems you don't think so.

Unfortunately, I don't understand, in particular, your distinction between the datatype's and the client's concern:
Data.display_data itself implements a poor man's directory display for the case that a link like datasets/ddaca2bad6847b13/display/dataset_22960784-8945-48d8-86fb-46d3c00a8b3e points to a filename in extra_files_path, which is actually a directory. In the case that the link leads to a file the same method will serve that file. So depending on its input params that method is making lots of decisions and can return very different things.
My proposed subclass method leaves almost all of this untouched, but special-cases one more situation, which is links like datasets/ddaca2bad6847b13/preview/. For this particular case, it will use the parent class' already existing directory display to show the contents of the root folder, which I don't think is a very far-fetched preview for a directory datatype.

Now for potential subclasses of Directory that may define index files, the display_data method will display that index file's content as a preview instead.

Data.display_data, in its docstring, also has the warning: Datatypes should be very careful if overriding this method and this interface between datatypes and Galaxy will likely change. so I thought it'd be better to implement enough flexibility in Directory.display_data so that subclasses won't immediately have to override the method again, but instead there's one hopefully carefully reviewed place where the magic happens. That's why I handle the index file case in the Directory class even though I agree with you that it's of no relevance for that class itself, but it's also logical to think that a number of directory subclasses will have something to use as an index file.

In general, there is no urgency here, and I do not intend to get into any heated argument over this. I'm willing to adjust the code and learn about your opinion and the reason behind it. Above are my reasons for implementing this first version like I did, and all I can say is that I gave this quite some thought, but never expected my first attempt to be perfect nor even close to it.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 7, 2024

I should say that our datatype code is quite sub-optimal in so many places, in part due to us not being very strict with reviewing them (for a good reason, we want to collect all those domain-specific datatypes), and having had to resort to server-side templating for the longest time in Galaxy's history. That in turn means that it's not always appropriate to just model new code on existing code, especially if you're working on important datatypes that we have to build on going forward. Now I don't mind altering stuff in display_data, we can always remove that later, but I don't think we should add more data to the database that we're not using.

itself implements a poor man's directory display for the case

All of these should eventually go away, just like the "bam-to-sam-to-tabular" display. You can produce a listing of extra files via the API, which is what a directory browser visualization should use to implement a browsable interface. We shouldn't have to stick that into the database. Making the visualizations first-class is a priority on the roadmap and we're not far away from getting there.

Datatypes should be very careful if overriding this method and this interface between datatypes and Galaxy will likely change.

I would take this comment quite verbatim to mean that the interface can change, in terms of function signature or expected return values. That comment likely comes from an era when we still had datatypes on the tool shed. I don't think that the number of subclasses matters, and in my previous comment I suggested that you can implement a directory-style subclass and a concrete implementation that uses the additional data you want to store in the database, which is my number one issue with adding unused metadata elements. And then we'll also see how all that is actually used.

What's here is great, could you not just break away the extra metadata elements and add them as a different parent class for your zarr datatype ?

@mvdbeek mvdbeek removed this from the 24.1 milestone May 14, 2024
@mvdbeek mvdbeek self-requested a review May 14, 2024 14:29
@bgruening
Copy link
Member

The Biohackathon is coming up soon again. What is the status here? My understanding is that we need this for the upcoming Zarr datatype and Zarr Visualisation?

@davelopez davelopez marked this pull request as draft October 9, 2024 08:14
@davelopez davelopez force-pushed the archive-to-dir-converter branch 2 times, most recently from 687d71e to aebb357 Compare October 9, 2024 13:10
@davelopez davelopez marked this pull request as ready for review October 9, 2024 13:18
@github-actions github-actions bot added this to the 24.2 milestone Oct 9, 2024
@davelopez davelopez mentioned this pull request Oct 9, 2024
4 tasks
@davelopez davelopez force-pushed the archive-to-dir-converter branch 2 times, most recently from 0d5123b to af45cd1 Compare October 10, 2024 10:49
@davelopez davelopez marked this pull request as draft October 11, 2024 08:34
@davelopez davelopez force-pushed the archive-to-dir-converter branch from aa91fbd to ee24ca3 Compare October 11, 2024 08:42
@davelopez davelopez marked this pull request as ready for review October 11, 2024 14:37
@davelopez davelopez force-pushed the archive-to-dir-converter branch from b9df77a to 6f0713b Compare October 14, 2024 07:34
@@ -1212,6 +1212,18 @@ def regex_line_dataprovider(
class Directory(Data):
"""Class representing a directory of files."""

file_ext = "directory"

def _archive_main_file(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for roundtripping a directory via the API ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can try!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test, but I'm not sure why I'm getting ModuleNotFoundError: No module named 'galaxy' when running the converter tool... using the UI seems to work fine 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the API tests are setting up dependency resolution. I am mostly interested in verifying that the structure of the tar archive is the same pre-and-post upload. In fact I think even the checksum should match if we're not compressing the archive. In either case if you upload a tar file and download it again that should be sufficient. The converter is tested in the test framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see, so you mean something like this fc7e959#diff-a6ab1700bcef9e1585a2bb0f84e8888470a770fb81c3e0337930e7cad573093fR662

I'll do that 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this:

    def test_fetch_directory(self, history_id):
        testdir = TestDataResolver().get_filename("testdir.tar")
        with open(testdir, "rb") as fh:
            details = self._upload_and_get_details(
                fh, api="fetch", history_id=history_id, ext="directory", assert_ok=True
            )
        assert details["file_ext"] == "directory"
        assert details["file_size"] == 3584
        content = self.dataset_populator.get_history_dataset_content(
            history_id, dataset=details, to_ext="directory", type="bytes"
        )
        dir_path = decompress_bytes_to_directory(cast(bytes, content))
        assert dir_path.endswith("testdir")
        for path, entry_class in EXPECTED_CONTENTS.items():
            path = os.path.join(dir_path, os.path.pardir, path)
            if entry_class == "Directory":
                assert os.path.isdir(path)
            else:
                assert os.path.isfile(path)

But if I don't run the converter manually instead of the to_ext="directory" the extra_files_path is empty, I guess that is why you have more changes persisting extra files in the object store in your referenced branch dev...mvdbeek:galaxy:directory_datatype_improvements#diff-8640d91ef47bca302b00039012979f4b1b79f5dbffbe2431bc9a05f19fb4c7d0R132

Should we merge your branch instead? Is something still missing in your branch or should that be how to do it?

Sorry, I'm a bit lost 😅

Copy link
Contributor

@davelopez davelopez Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek, re-reading your comment:

In either case if you upload a tar file and download it again that should be sufficient.

do you mean something simpler like this instead?

def test_upload_tar_roundtrip(self, history_id):
        testdir = TestDataResolver().get_filename("testdir.tar")
        expected_hash = md5_hash_file(testdir)
        expected_size = os.path.getsize(testdir)
        with open(testdir, "rb") as fh:
            details = self._upload_and_get_details(fh, api="fetch", history_id=history_id, assert_ok=True)
        assert details["file_ext"] == "tar"
        assert details["file_size"] == expected_size
        content = cast(
            bytes, self.dataset_populator.get_history_dataset_content(history_id, dataset=details, type="bytes")
        )
        assert len(content) == expected_size
        dir_path = decompress_bytes_to_directory(content)
        expected_contents = {
            "testdir": "Directory",
            "testdir/c": "Directory",
            "testdir/a": "File",
            "testdir/b": "File",
            "testdir/c/d": "File",
        }
        assert dir_path.endswith("testdir")
        for path, entry_class in expected_contents.items():
            path = os.path.join(dir_path, os.path.pardir, path)
            if entry_class == "Directory":
                assert os.path.isdir(path)
            else:
                assert os.path.isfile(path)

        with tempfile.NamedTemporaryFile("wb") as temp:
            temp.write(content)
            actual_hash = md5_hash_file(temp.name)
            assert actual_hash == expected_hash

If this is what you mean, the uploaded vs downloaded tar size and contents match, but the hashes don't (not sure why).
I still don't see the connection with the directory datatype or the converter changes in this PR, so I might be misunderstanding something 😞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've overwritten _archive_main_file, as a result you've made sure you're not getting an (empty) extra file added to the archive.

test_upload_tar_roundtrip looks fine to me, you sure you don't need to flush and that that is why the checksums don't match ? You could also skip the hash and just compare the bytes. If the contents are the same, no added or removed files or structure then it's all good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh that was it... I was missing the flush 🙈

Thank you very much!!

@davelopez davelopez marked this pull request as draft October 14, 2024 11:59
@davelopez davelopez mentioned this pull request Oct 22, 2024
4 tasks
@davelopez davelopez requested a review from mvdbeek October 29, 2024 08:46
@davelopez davelopez force-pushed the archive-to-dir-converter branch from a9d4a60 to 9ef9a43 Compare October 30, 2024 15:59
@davelopez davelopez marked this pull request as ready for review October 30, 2024 17:15
@bgruening
Copy link
Member

Thanks everyone!

@bgruening bgruening merged commit c6a10d6 into galaxyproject:dev Oct 30, 2024
55 checks passed
Copy link

This PR was merged without a "kind/" label, please correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants